# Lightweight VLM

Smolvlm 500M Anime Caption V0.2
Apache-2.0
A vision-language model specialized in describing anime-style images, fine-tuned based on SmolVLM-500M-Base
Image-to-Text English
S
Andres77872
17
0
Smolvlm 500M Anime Caption V0.1
Apache-2.0
A vision-language model specialized in describing anime-style images, fine-tuned from SmolVLM-500M-Base, trained on 180K synthetic image/caption pairs generated by large language models.
Image-to-Text English
S
Andres77872
61
0
Granite Vision 3.2 2b
Apache-2.0
granite-vision-3.2-2b is a compact and efficient vision-language model specifically designed for visual document understanding, capable of automatically extracting content from tables, charts, infographics, and more.
Image-to-Text Transformers English
G
unsloth
43
1
Paligemma 3b Ft Science Qa 448
PaliGemma is a 3B-parameter lightweight vision-language model developed by Google, built upon SigLIP vision model and Gemma language model, supporting image and text inputs to generate text outputs.
Image-to-Text Transformers
P
google
15
2
Paligemma 3b Pt 448
PaliGemma is a lightweight and versatile vision-language model built on the SigLIP vision model and Gemma language model, supporting multilingual image-text interaction tasks.
Image-to-Text Transformers
P
google
2,708
29
Paligemma 3b Pt 896
PaliGemma is a versatile lightweight vision-language model (VLM) that supports image and text inputs and generates text outputs. It has multilingual capabilities.
Image-to-Text Transformers
P
google
1,788
119
Paligemma 3b Mix 448
PaliGemma is a versatile lightweight vision-language model (VLM) built upon the SigLIP vision model and Gemma language model, supporting image and text inputs to generate text outputs
Image-to-Text Transformers
P
google
5,488
109
Paligemma 3b Ft Docvqa 896
PaliGemma is a lightweight vision-language model developed by Google, built on the SigLIP vision model and the Gemma language model, supporting multilingual image-text understanding and generation.
Image-to-Text Transformers
P
google
519
9
Paligemma 3b Ft Refcoco Seg 896
PaliGemma is a lightweight vision-language model developed by Google, built upon the SigLIP vision model and Gemma language model, supporting multilingual text generation and visual understanding tasks.
Image-to-Text Transformers
P
google
20
6
Paligemma 3b Mix 224
PaliGemma is a versatile, lightweight vision-language model (VLM) built upon the SigLIP vision model and Gemma language model, supporting image and text inputs with text outputs.
Text-to-Image Transformers
P
google
143.03k
75
Paligemma 3b Pt 224
PaliGemma is a versatile lightweight vision-language model (VLM) built upon SigLIP vision model and Gemma language model, capable of processing both image and text inputs to generate text outputs.
Image-to-Text Transformers
P
google
38.40k
318
Paligemma 3b Ft Vqav2 448
PaliGemma is a lightweight vision-language model developed by Google, combining image understanding and text generation capabilities, supporting multilingual tasks.
Text-to-Image Transformers
P
google
121
17
Paligemma 3b Ft Ocrvqa 448
PaliGemma is a versatile lightweight vision-language model (VLM) developed by Google, built on the SigLIP vision model and Gemma language model, supporting both image and text inputs with text outputs.
Image-to-Text Transformers
P
google
365
6
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase